MASSACHUSETTS INSTITUTE OF TECHNOLOGY 


PROJECT MAC 


Artificial Intelligence: 

Mesw. lie. 153. ■ January 196-H. 


REES 

A CONVERT PROGRAM TO 

REALIZE THE HcNAUGHTON“YAMAHA ANALYSIS ALGORITHM 


Harold Vr McIntosh 


+ 


* ESCUELA SUPERIOR DE Fl$LCA Y MATEMATICA3 
IN&muTO POLITECW1CO NACIONAL 
MEXICO 14 H*F,, MEXICO, 


ABSTRACT 


REEX is a CQNVEKT program, realised in the CTSS t-IS-P of 
Project MAC, fox carrying cut the McNsughton-Vamada analysis 
algorithm, whereby a regular expression is found describing the 
words accepted by a finite state machine whose transition table is 
git-on. Unmodified the algorithm will produce 4 rj terns representing 
an n-state machine. This number could be reduced by eliminating 
duplicate calculations and rejecting on a high level expressions 
corresponding to no possible path in the State diagram* The 
remaining expressions present a serious simplification problem, since 
empty expressions and null words are generated liberally by the 
algorithm* REDX treats only the third of these problems, and at 
that makes simplifications mainly oriented toward removing null 
words, empty expressions, and expressions of the form JiuX*, AuB*A, 
and others closely similar* REEX is primarily useful to understand 
the algorithm, but hardly useable for machines with six or more states. 
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Since regular expressions form such a convenient -character!cation 
of the words accepted by a finite state machine it is desirable to have 
a means of deducing the descriptive regular expression from, the 
transition table of the caching* The first such algorithm was described 
by McK aught on and Yamada, and for ser-e time was apparently the only such 
algorithm known* lVhiie it is- conceptually quite simple* its application 
in practice can lead to grossly cumbersome expressions* Nevertheless 
its mechantration is instractive, and is the object of the CONVERT program 
REEK. 

Kc will U50 the following eXaiople to illustrate our discussion. 
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The states of the machine to be analysed are supposed to be numbered, 

t 

1 through n. He then define regular expressions . recursively as 


follows, 


*ii 




{ocI|M(i»d} 
{ocE |>1(i 


i) bi 
)) 


<i ■ ij 


3c—2 , k-1 . k-1 * k-1 

v “ik "t-iui > -kj 


V 

By examining these definitions it can be seen that a. j is a 
regular expression representing all the word* corresponding to transitions 
from state i to state J without passing through states numbered greater 
than It* Transitions froci state i to state f without any such restriction 
arc then given by the expressions a"., and the regular expression 
representing the machine is the union of such expressions where i is the 
initial state and j belongs to the accepting Set, 

The transcription of this recursive definition into CONVERT presents 
no complications* In fact* if we let the index k be the state set* and the 


indices i and j be actual states* we may avoid the necessity of 
number!ci| the states. The pattern wo are to recognite Is then the list 
13 J K) * and it will be seen in the thrse forms 

(I I G) 

Cl F Q} 

(E F (X XXX)) 

The skeletons which will be substituted in the three cases will ba 
respectively a set of letters union the null letter, a set of letters, 
and the CONVERT fem of the recursive formula, b'e ne®d a means of 
extracting the letters causing a given transition* for which wc introduce 
the patterns (U*) or (T*). 

CU*> PAT C(*W* C— Cl L i> u*) (*«*))) 

£T*] PAT ((*0R* (— (I L F) T‘) (*—))) 

where 

L BUV -ATt> 

These definitions assume that the transition table TT is presented 
in the form (■■■ (r L F) me*) where I is the state from, which the letter 
l causes a transition to F; le M£I*L) ■ F, (y*) and (T*) art then 
collection patterns using the bucket variable L to find all the letter* 
causing transitions respectively from I to I* or I to F, 

If we use the symbol available in the character set of CTSS LISP,, 
for the mill word* we can now write the McNaughtdn-Yar.ada algorithm in 
CONVERT form* The rules are 

(G I 0) (*NHEN» TT (U*) (UtfO $ (niNON* 1)])) 

((I F 0) (-WIEN* TT (T*) (UNO (^UNON* L)))) 

(£l F (X XXX)) (UNO {-REFT- (I F (XXX)} 

(CON (-REPT* (I X (XXX))) 

(ITR (-REPT- (X X (XXX)))) 

(-REPT- (X F (XXX)))))) 

The symbols UNO, CAN* and 3TR stand respectively for union, ooncatination, 
and iteration* the "polish” form of the connectors which form regular 
expressions, In actual practice* the state set is not given and must be 
deduced from the transition table. This is dene with a bucket variable 
and collecting skeleton. 

S BUV *ATQ* 

(S*) PAT ((*0R* ( — (5 — S) 5*} (*—))) 
which is applied to the transition table* Hence our actual CONVERT program 
contains the rule 


cCl F [£*]] (UNO (*ITER* J F (»REPT= (I J (*UM0W* S>) *1 {#**)))]] 
wherein H & £ is the rule set displayed above * In this way wc take 
account of all the states in t3ic accepting set, and obtain the State 
sot implicitly from the bucket variable S* It follows that (REEX I F 1} 
has as arguments the initial state* the sot of accepting states* and the 
transition table as a list of triplets of the form (I L F)* 

If ho now set out to calculate some examples ve begin to find 
that tbn algorithm is hot really Very satisfactory* mathematically 
correct that it may he* The basic problem is that the terminal condition 
very often leads to an empty set, Cwisequently if it occurs as a tenn 
in the cceoatination* the coneatinated expression will likewise be an 
empty set* However* this is not obvious from the expression which the 
algorithm produces* and some simplification must be mads, We run the 
risk that we may calcula e all three terms of the concatination before 
discovering that one of these renders the result trivial. 

A Second problem is that a very great deal of duplicate calculation 
can Occur, In other words, the sa=e subexpression may arise from a 
variety of histories, and be calculated anew each time, This is the more 
tice consuming* the higher the level on which it occurs. 

A third flaw lies in the fact that the petfood is prone to produce 
redundant expressions; that is such things as the union of X and $ 
concatinated with X* in the simple case, or 110* union 0*1 IQ** to cite a 
slightly more complicited example* if nil such redundancies wests of a 
simple nature* they could be edited sway, but unfortunately- they become 
rore and more subtle as the number of states of the machine increases* 

They arise from otherwise identical paths which do or do not include 
certain, loops or branches* 

The order in which the states a ire listed not only ray lead to 
alternate regular expressions representing the machine ( but sometimes 
can lead to vastly different amounts of calculation even when simplifying 
and pro caution ary techniques arc included* McNaughtcn-Yar.ada raccommcnd 
giving high numbers to heavily trafficked states* to increase the number 
of null subexpressions which can be recognized Oil sight, In this regard 
it is certainly profitable to find which pairs of points have no path 
whatsoever connecting them, for if there is no such path there will be 
none passing through designated intermediate states either, and one can 
write the empty expression at dice without proceeding through the recursion 


To get an idea of now impressively expansive the algorithm is, 
kb have to see that at e&,ch step in the recursion wo generatn four new 
regular □ xpre$£iOnS, bound together by various operators , Hence after 
n steps, when the recursion terminates, we hseve 4 r ’ expressions; a truly 
exponential growth* Moreover this number is to be multiplied by the number 
of accepting states* 

In the example we have cited,, the regular expression corraspending 
to initial state 1 and accepting state 2 is clearly Q*1D** If we list only 
subscripts, the expression we need is [i j k) ■ (l 2 3), Expanding, we 
find 


120 

110 



The indices which have been lined out are those which are repeated* 
so that their calculation an additional tiise is redundant. Moreover* the 
circled index 522 is 0 on the grounds that no arrow of any sort runs From 
state 3 to state 2* Hence the concatination of the last three terns will 
also bo £?, and only the first express inn need be pursued, an observation 
which would immediately reduce the calculation tp 1/4* 







E^en if that simplification were not madc^ half the terns on the 
third level and over half those an the fourth level are redundant * 
reducing the calculation to i/4* Eath simplifications are reasonably 
typical of mere complex expressions* For the moment let us use the 
simplification afforded by 322 ■ (3 p to write 

lzi = m 

* 121 u 121*221**221 

ndw, 

121 - 120 u 110*110**120 

* 1 u (0 u $]'(0 u S)*l 
and 

221 - 22Q u 21G-* (110} **120 

* CO u $) u 0*(0 u $]*1 

Hera wa see on the lowest level quite simple expressions written 
in a very' cumbersome way* For example, 121 sirplifies to 0*1 ( while 
221 is (0 u $), Thus 

123 - 122 ■ 0*1 u 0*1(0 u $)*(& u $> 

M 0*10* 

Thus the higher levels continue to contribute clumsiness, even though 
we finally obtain the obvious result* 

Out present pro gran makes no attempt to avoid duplicate calculations' 
nor to exclude those which are destined to produce 0 for lack of any 
possible paths* One would think it a small sacrifice to reserve an array 
of elements to retain this information, since otherwise the calculation 
will be far too time-consuming to treat machine# with even half a dozen 
elements* Even so, such measures seem to bo destined to lower the rate 
of growth only to 2 n rather than 4^* 

However, we have studied to a slight extent the third problem, of 
simplifying the expressions produced by the algorithm. It was clear 
from examining results that some very simple redundancies were accounting 
for a substantial fraction of the complexity in the final result. The 1 
technique is to make the regular expression operators UNO, CON, and Tift 
into functions responsible for simplifying their arguments* 


Let us review them one by one* 

COM REFT ( 

££-“- D G) 

tCK) X) 

CCxxx s yyy) (=ft£pr= cm mj)> 

((XXX (CN YYY) 222) (-SEPT- (XXX VYV 222))) 

{(XXX (IT X) £UN | X) YYY) £*REPT* [XXX (IT X) YYY))) 

£(XXX (UK $ X] (TT X) YYY) [=REPT= (XXX (IT X] YYY) 33 
[ — (Of 11 SAME*]) 
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The significance of these simplifications are ( line by lino 

A eoncstinatien involving the empty set is empty 

No sign of operation is written to concatenate one elegant 

The null word need not be written explicitly 

Con one in at ion. is associative 

[Ju X)-X' - X*-(S ,u X) - X* 

Othervtso prefix notation is used with the symbol CN* 

There is already apparent in the simplification (i u X)*X P the fact 
that there are a great many equivalent forms which it is a nuisance to have 
to list separately; here we have used two rules for ($ u X)*X* and 


X**fi u X] r however there should be two more for (X u $)*X* and X P (X u $} t 
The unordered variables mode of CQKVEStT is helpful in such situations * and 


is used in the simplification of the union, Kb have 


I 

UNO 

(X =1=) 

J 

UNO 

(X {IT X}) 

X 

UNO 

{(CN WWW) AE*) 

L 

UNO 

(E*A (ON WWW] ] 

AS* 

PAV 

(CM (IT .*] WWW) 

B*A 

PAY 

(CN WWW (IT *=)) 

■1 ■ 

PAY 

(*OEi* (CN X (IT >*"J ) 

With 

these 

ConStitucnta„ vc have 

UNO 

mvr 

' ((*= (-REPT- (-UN0N- 


3) (cm <rr «•) X)) 


£(XXX 0 YYY) t-REPT- {XXX 

((XXX (UN YYY) ZZZ) (-EEPT= (XXX 

((XXX J YYY J ZZZ} {=KEPT- (XXX 

££XXX 3 YYY T ZZZ} (-REPT- {XXX 

((XXX K YYY K 12Z) (=REPT= (XXX 

((XXX L YYY L 22Z) (-REPT* (XXX 

£(X XXX $ YYY) (-KEPT- (S X 

(£X3 X3 

l>- (UK *£AWE+)) 

333) 


YYY))) 

YYY IZ2)}) 

YYY (IT X) 22Z)}) 
YYY =1= ZZZ ») 

YYY AB* 122)3) 

YYY i*A 22Z3)3 
XXX YYY33) 


Again we may make a line-by-line analysis* On entry to the 
function UXO, wo eliminate obviously repeated arguments* Then 


The null set is deleted frc=i a union 
Union is associative 

If X and X* appear In the union t we retain only X* 
If both X and XY* appear then we retain only XY* 
Likewise if b'WW X* and WWW appear 
Or X- WWW and NW 

In a list of at least one element we place the null 
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word first* an assumption made in the simplification 
of the concatination, 

No write no operator for the union of one element 
Otherwise the prefix tW is written in prefix notation* 

finally* the simplifications of an iteration are the following, 

3TR KEPT (((*) f=PEPT= X '3 [ 

((UN XXX $ YYY) (*ft£PT- [UfifO XXX YYY))) 

C$ S3' 

CO S3 

(« tn -SAME-)) 

> 33 ) 

There tfi Vary 1 little Simplification that we have seen fit to dp 
directly on an iteration* The initial transformation is used to 
overcane the fact that CCNVEffT functions list their arguments* ever, when 
there is only an*, Wb note that 3* * ()* * Ij, otherwise the expression 
is left intact. 

To sen how effective these rules are we could consider some 
examples. 


reex {e {i it) «o I ii) (a 0 i) (1 1 ii) (i 0 iv) 

Cii G i) (ii l iv) (iv C iv) {iv 1 iv))) 

(UK 0 {CK 1 0) (Ctf [(« D iCH 1 0) (IT (CN 1 G») 1 ( Oi (UN 
o £Cn ) o3) (rf £ch i 0)) i» 



o»iO« r (Ouio)Cio)^]ttla [(o« id) (i-o)*i] 







reex [o [i ti> £Ce 0 iv) to l i> <i & ii] (i l iv) 

C£ii Q iv) (ii I i) (iv 0 iv) (iv : iv))) 

(UN 1 (CN l 0 (IT {CM 10)) i) £CK 1 & (IT (CN 1 0}))} 



Miofio)*i}u do (ig)*) 


As may be seen from the examples* the rules given succeed in 
eliminating almost all of the complexity due to redundant empty 
expressions and null words. Novcrthaleas„ they are reasonably ad hoc, 
and do not eliminate more subtle types Df redundancy, The subject might 
be worth pursuing further to test cues understanding of the simplification 
process * but A basic fault qf the method is that it generates such 
cumbersome ..‘and so numerous expressions initially, fortunately there are 
mora amanahle techniques available to fom the regular expression which 
corresponds to a machine or transition system* principally the method of 
writing a series of regular expression equations for the states and 
solving then simultaneously, 
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DEFINE (( 
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IREEX (LAMBDA (1 FF TTI (CONVERT 

ICONS (QUOTE TT) ICONS (QUOTE EXPR) (CONS TT (QUOTE ( 


L 
S 

■ I® 
I 

. J 
. K 
L 

AB* 

B*A 

IU*) 

IT*) 

(S.M 

CON 


BUV 

BUV 

PAV 

UNO 

UNO 

UNO 

UNO 

PAV 

PAV 

PAT 

PAT 

PAT 

REPT 


= A TO= 

(=OR® (CN X (IT =»)) (CN (IT =-) X)) 
(X -I») 

(X (IT XI) 

((CN WWW) AB*) 

( B* A (CN WWW) ) 

(CN (IT -*) WWW) 

(CN www (it —)) 


(l*OR* 
I(*OR* 
((*OR* 
( 

( ( 


( 

(•»•' 
( 


( ) 


(I 

(I 

(S 


L I) 

L F) 
— S) 


U*) (*—))) 
T*) ( —))) “ 
S*) (»«■))) 


) ()) 


UNO 


REPT 


i (x) x) •• i *. rr. “;*• 

((XXX $ YYY) (=REPT= (XXX YYY))) 

((XXX (CN YYY) ZZZ) (-REPT* (XXX YYY ZZZ))) 

. ((XXX (IT X) (UN $ X) YYY) (*REPT- (XXX (IT X) 
((XXX (UN S X) (IT X) YYY) (-REPT- (XXX (IT X) 
(«» (CN •SAME*)) 

(I*® (-REPT* (=UNON- -SAME®) *2 I 


YYY))) 

YYY))) 


■ •« • - 


.v ((XXX () 

YYY) 



(-REPT- (XXX YYY) ) )' 


((XXX (UN 

YYY) 

ZZZ) 

(*REPT« (XXX YYY ZZZ) ) ) 


KXXX J 

YYY 

J 

ZZZ) 

I-REPT- (XXX YYY (IT X) 

ZZZ))) 

(IXXX I 

YYY 

I 

ZZZ) 

(-REPT- (XXX YYY *I» 

ZZZ) )) 

((XXX K 

YYY 

K 

ZZZ) 

T-REPT* (XXX YYY AB* 

ZZZ))) 

KXXX L 

YYY 

L 

ZZZ) 

(■REPT® (XXX YYY B*A 

ZZZ))) 

((X XXX $ 

YYY) 

(■ 

REPT* 

(S X XXX YYY))) 


((X) X) 






( ( ) 1 ) ) 





- • - 


ITR 


REPT 


( — (UN *SAME*)) 

) ) )) 

(I(X) (-REPT- X *3 ( 

((UN XXX S YYY) (=REPT« 
(SI);. ' ' 

(!) S).. 

(** (IT -SAME-)) 

)))) 


(UNO XXX YYY))) 


1! 




- - - - . .at 


(WWW) (XXX) (YYY) (ZZZ) 


) ) ) ) ) 

(QUOTE ( 

I F X 

n • • . .™\' V ' — -* * - 

!(LIST I FF TT) . - • . V 

f (QUOTE (*0 ( 

!*{I p (S*)J (UNO (• I TER* J F (-REPT* (I J (-UNON- S’) *1 
III I ())■ (-WHEN* TT (U*) (UNO S (*UNON* L)))) 

III F ()) (-WHEN* TT (T*) (UNO (*UNON* L)))) 

((I F (X XXX)) (UNO (-REPT- (I F (XXX))) 

(CUN (-REPT- (I XIXXXM) 

/ . '(ITR (-REPT* (X X (XXX)))) 


^J (-REPT* (X F (XXX)))))) 


) ) ) ) ) 


I 










